October 24, 2016

What does my group do?

  • Study the molecular basis of variation in development and disease
  • Using high-throughput experimental methods

Software

  • State-of-the-art computational and statistical analysis platform
  • We develop and apply methods for these analyses in this platform
  • Our collaborators do analysis in this platform with us
  • metagenomeSeq
  • metagenomeFeatures
  • antiProfiles
  • minfi
  • bumphunter
  • HTShape
  • qsmooth
  • Rcplex
  • Rcsdp

Collaborative and exploratory analysis

  • Data transformation and modeling: data smoothing, region finding (R/Bioconductor: Bsmooth, minfi)
  • Exploration: search by gene, search by overlap
  • Contextual analysis: overlap with other data (our own, other labs, UCSC, ensembl)

Genomic Data Science

  • We have unprecedented ability to measure
  • and lots of publicly available data to contextualize it
[H. Wickham]

Integrative, visual and computational exploratory analysis of genomic data

  • Browser-based
  • Interactive
  • Integration of data
  • Reproducible dissemination
  • Communication with R/Bioconductor: epivizr package
e.g.: http://epiviz.cbcb.umd.edu/?ws=YOsu0RmUc9l
[Nat. Methods, 2014]

Creativity in exploration

We are building software applications to support creative exploratory analysis of large genome-wide datasets…

[T. Speed]

Summarization: summarize integrated measurements (computed on data subsets)

Statistically-guided exploration: Calculate a statistic of interest

# Get tumor methylation base-pair data
m <- assay(se)[,"tumor"]

# Compute regions with highest variability across cpgs
region_stat <- calcWindowStat(m, step=25, window=80, stat=rowSds)
s <- region_stat[,"stat"]

Explore data based on statistic

What's around the regions with highest across CpG variability?

# get locations in decreasing order
o <- order(s, decreasing=TRUE)
indices <- region_stat[o, "indices"]
slideShowRegions <- rowRanges(se)[indices] + 1250000L
mgr$slideshow(slideShowRegions)

dynamically extensible: Easily integrate new data types and add new visualizations.

  • Based on classic "three-table" design in genomic data analysis
  • Data providers define coordinate space

Visualization design goals

  • Context
  • Integrate and align multiple data sources; navigate; search
  • Connect: brushing
  • Encode: map visualization properties to data on the fly
  • Reconfigure: multiple views of the same data
[Perer & Shneiderman]

Visualization goals

  • Data
  • Select and filter: tight-knit integration with R/Bioconductor;
  • (current work) filters on visualization propagate to data environment
  • Model
  • New 'measurements' the result of modeling; perhaps suggested by data context
[Perer & Shneiderman]

[H. Wickham]

One interpretation of Big Data is Many relevant sources of contextual data

  • Easily access/integrate contextual data
  • Driven by exploratory analysis of immediate data
  • Iterative process
  • Visual and computational exploration go hand in hand

Metagenomics (mixed genomes)

  • Discoveries: pathogenic associations for childhood diarrhea in developing world. (Genome Biology, 2014)
  • Methods: association discovery for metagenomic communities. (Nature Methods, 2013)
  • Tools: metagenomeSeq, metagenomicFeatures, metaviz
[Human Microbiome Project]

Metagenomics (mixed genomes)

What is the measurement?

Metagenomics (mixed genomes)

What is the measurement?

Samples:

Features:

Challenges for epidemiological metagenomic studies

  • Analysis units (features) unknown a priori
  • High levels of sparsity
  • Standard normalization methods don't work well
  • Confounders, i.e. study site, countries, etc.
  • Large number of features (Type 1 error control)

Normalization

Normalization

Normalization

Normalization

Zero-inflation

Zero-inflation

Zero-inflation

Zero-inflation

Zero-inflation

Zero-inflation

Zero-inflation

MetagenomeSeq

MetagenomeSeq

MetagenomeSeq

MetagenomeSeq

MetagenomeSeq

MetagenomeSeq

MetagenomeSeq

MetagenomeSeq

MetagenomeSeq

Summary

  • Diarrheal study consisting ~1000 samples (now ~3000).
  • Interesting microbiome for four countries / through ages
  • Novel normalization and differential abundance testing framework for marker-gene surveys

Hierachically organized features

Hierarchically organized features

Defining the measurement unit of analysis

Not just features, but samples may be hierarchically organized

Challenges

  • Underlying idea: cut in the tree defines unit of analysis
  • Visual design challenges: how to support effective exploration of cuts (persistence, consistency, density)
  • Data challenges: how to support efficient exploration of cuts (graph database backend for contextual data)
  • Engineering challenges: when do analysis patterns become interaction modes

Summary

  • Systems for interactive (and creative) data exploration and analysis of epigenomic and metagenomic data
  • Design exploration of hierarchical domains with statistical analysis as the ultimate goal
  • Collaborative, reproducible, with close connection with R/Bioconductor (metagenomeSeq, metagenomeFeatures, metavizr)

Acknowledgements

Justin Wagner, Jayaram Kancherla (CBCB)
Florin Chelaru (now at Twinfog), Joseph Paulson (now at Harvard)
Mihai Pop (CBCB) Feinberg Lab & K. Hansen (JHU), R. Irizarry (Harvard) HMP2 Project (Xavier and Huttenhower, Harvard)

Funding: NIH, Genentech, Gates Foundation

More information

http://hcbravo.org
@hcorrada